textual input
Robust Visual Reasoning via Language Guided Neural Module Networks
Neural module networks (NMN) are a popular approach for solving multi-modal tasks such as visual question answering (VQA) and visual referring expression recognition (REF). A key limitation in prior implementations of NMN is that the neural modules do not effectively capture the association between the visual input and the relevant neighbourhood context of the textual input.
Robust Visual Reasoning via Language Guided Neural Module Networks
Neural module networks (NMN) are a popular approach for solving multi-modal tasks such as visual question answering (VQA) and visual referring expression recognition (REF). A key limitation in prior implementations of NMN is that the neural modules do not effectively capture the association between the visual input and the relevant neighbourhood context of the textual input.
Multimodal ML: Quantifying the Improvement of Calorie Estimation Through Image-Text Pairs
Various developed countries have experienced a continuous rise in obesity [1]. This can be burdensome on healthcare and has thus prompted governments to introduce laws and regulations to restaurants and food chains in an attempt to promote healthier eating [1]. One patent rule is the mandatory display of calories in menus, which has not come without an added cost to businesses. For example, Section 4205 of the Affordable Care Act (ACA) in the USA required certain food chains to display caloric information on menu items which was estimated to cost $315 million to comply [2]. These costs were primarily due to nutritional analysis - if one could drastically diminish this overhead, it would greatly lower business spending [2].
When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models
Wang, Cheng, Deng, Gelei, Yang, Xianglin, Qiu, Han, Zhang, Tianwei
Large Audio-Language Models (LALMs) are enhanced with audio perception capabilities, enabling them to effectively process and understand multimodal inputs that combine audio and text. However, their performance in handling conflicting information between audio and text modalities remains largely unexamined. This paper introduces MCR-BENCH, the first comprehensive benchmark specifically designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, frequently disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We further investigate the influencing factors of text bias, and explore mitigation strategies through supervised finetuning, and analyze model confidence patterns that reveal persistent overconfidence even with contradictory inputs. These findings underscore the need for improved modality balance during training and more sophisticated fusion mechanisms to enhance the robustness when handling conflicting multi-modal inputs. The project is available at https://github.com/WangCheng0116/MCR-BENCH.
PA-RNet: Perturbation-Aware Reasoning Network for Multimodal Time Series Forecasting
Liu, Chanjuan, Wang, Shengzhi, Zhu, Enqiang
In real-world applications, multimodal time series data often suffer from interference, especially in the textual modality. Existing methods for multimodal time series forecasting often neglect the inherent perturbations within textual data, where irrelevant, noisy, or ambiguous content can significantly degrade model performance, particularly when the noise exhibits varying intensity or stems from structural inconsistencies. To address this challenge, we propose PA-RNet (Perturbation-Aware Reasoning Network for Multimodal Time Series Forecasting), a robust multimodal forecasting framework. PA-RNet features a perturbation-aware projection module and a cross-modal attention mechanism to effectively separate noise from the textual embeddings while maintaining semantically meaningful representations, thereby enhancing the model's generalization ability. Theoretically, we establish the Lipschitz continuity of PA-RNet with respect to textual inputs and prove that the proposed perturbation module can reduce expected prediction error, offering strong guarantees of stability under noisy conditions. Furthermore, we introduce a textual perturbation pipeline that can be seamlessly incorporated into existing multimodal time series forecasting tasks, allowing for systematic evaluation of the model's robustness in the presence of varying levels of textual noise. Extensive experiments across diverse domains and temporal settings demonstrate that PA-RNet consistently outperforms state-of-the-art baselines.
Smooth Operators: LLMs Translating Imperfect Hints into Disfluency-Rich Transcripts
Accurate detection of disfluencies in spoken language is crucial for enhancing the performance of automatic speech and language processing systems, as well as fostering the development of more inclusive speech and language technologies. Leveraging the growing trend of large language models (LLMs) as versatile learners capable of processing both lexical and non-lexical inputs (e.g., audio and video), we propose a novel approach to transcribing disfluencies as explicit tokens with timestamps, enabling the generation of fully annotated disfluency-rich transcripts. Our method integrates acoustic representations extracted from an audio encoder with textual inputs of varying quality: clean transcriptions without disfluencies, time-aligned transcriptions from aligners, or outputs from phoneme-based ASR models -- all of which may contain imperfections. Importantly, our experiments demonstrate that textual inputs do not need to be flawless. As long as they include timestamp-related cues, LLMs can effectively smooth the input and produce fully disfluency-annotated transcripts, underscoring their robustness in handling imperfect hints.
MPL: Multiple Programming Languages with Large Language Models for Information Extraction
Li, Bo, Fang, Gexiang, Ye, Wei, Xu, Zhenghua, Zhang, Jinglei, Cheng, Hao, Zhang, Shikun
Recent research in information extraction (IE) focuses on utilizing code-style inputs to enhance structured output generation. The intuition behind this is that the programming languages (PLs) inherently exhibit greater structural organization than natural languages (NLs). This structural advantage makes PLs particularly suited for IE tasks. Nevertheless, existing research primarily focuses on Python for code-style simulation, overlooking the potential of other widely-used PLs (e.g., C++ and Java) during the supervised fine-tuning (SFT) phase. In this research, we propose \textbf{M}ultiple \textbf{P}rogramming \textbf{L}anguages with large language models for information extraction (abbreviated as \textbf{MPL}), a novel framework that explores the potential of incorporating different PLs in the SFT phase. Additionally, we introduce \texttt{function-prompt} with virtual running to simulate code-style inputs more effectively and efficiently. Experimental results on a wide range of datasets demonstrate the effectiveness of MPL. Furthermore, we conduct extensive experiments to provide a comprehensive analysis. We have released our code for future research.
Robust Visual Reasoning via Language Guided Neural Module Networks
Neural module networks (NMN) are a popular approach for solving multi-modal tasks such as visual question answering (VQA) and visual referring expression recognition (REF). A key limitation in prior implementations of NMN is that the neural modules do not effectively capture the association between the visual input and the relevant neighbourhood context of the textual input. For instance, NMN fail to understand new concepts such as "yellow sphere to the left" even when it is a combination of known concepts from train data: "blue sphere", "yellow cube", and "metallic cube to the left". In this paper, we address this limitation by introducing a language-guided adaptive convolution layer (LG-Conv) into NMN, in which the filter weights of convolutions are explicitly multiplied with a spatially varying language-guided kernel. Our model allows the neural module to adaptively co-attend over potential objects of interest from the visual and textual inputs.
WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs
Yang, Deshun, Hu, Luhui, Tian, Yu, Li, Zihao, Kelly, Chris, Yang, Bang, Yang, Cindy, Zou, Yuexian
Several text-to-video diffusion models have demonstrated commendable capabilities in synthesizing high-quality video content. However, it remains a formidable challenge pertaining to maintaining temporal consistency and ensuring action smoothness throughout the generated sequences. In this paper, we present an innovative video generation AI agent that harnesses the power of Sora-inspired multimodal learning to build skilled world models framework based on textual prompts and accompanying images. The framework includes two parts: prompt enhancer and full video translation. The first part employs the capabilities of Chat-GPT to meticulously distill and proactively construct precise prompts for each subsequent step, thereby guaranteeing the utmost accuracy in prompt communication and accurate execution in following model operations. The second part employ compatible with existing advanced diffusion techniques to expansively generate and refine the key frame at the conclusion of a video. Then we can expertly harness the power of leading and trailing key frames to craft videos with enhanced temporal consistency and action smoothness. The experimental results confirm that our method has strong effectiveness and novelty in constructing world models from text and image inputs over the other methods.